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Abstract 

We introduce a statistical data model and an associated 
optimization-based clustering algorthm which allows data vectors 
to belong to zero, one or several “parent” clusters. For each data 
vector the algorithm makes a discrete decision among these 
alternatives. Thus, a recursive version of this algorithm would 
place data clusters in a Directed Acyclic Graph rather than a tree. 
We test the algorithm with synthetic data generated according to 
the statistical data model. We also illustrate the algorithm using 
real data from large-scale gene expression assays. 


1 Introduction 

Clustering algorithms traditionally construct clusters which are related by 
placement in a tree (hierarchical clustering) or embedding in a low-dimensional 
space (self-organizing maps). We seek to generalize the deterministic annealing 
approach to clustering under mixture models [1][2][3] so that when used recursively 
it can construct a Directed Acyclic Graph (DAG) of clusters rather than a tree. As a 
key development to this end, we consider here the recursively applicable step of 
clustering a set of feature vectors into groups so that each vector belongs to zero, 
one, or two clusters. Thus each data vector can have multiple “parent” clusters. We 
report on a generative model for such data using stochastic parameterized 
grammars, derive an appropriate constrained optimization problem for inferring 
parent clusters from data, and define and test a suitable optimization algorithm for 
this problem. 


2 Theory 


2.1 DataModel 

A generative, statistical data model is defined using stochastic parameterized 
grammars [4] as follows. A stochastic grammar has a “start” symbol, and a unique 
rule which transforms this symbol into a level-zero “cluster” symbol with numerical 
parameters including a mean and (possibly diagonal or scalar) covariance which are 
specified by the grammar rather than generated by a probability distribution. This 
level-zero cluster can serve as the left hand side for several different rules. One rule 
simply destroys the cluster. Another permits it to create data vectors according to 
its mean and covariance; these model “distractor” data which do not belong to any 
actual (level-one) cluster. Finally, a level-zero cluster may generate a level-one 
cluster whose level-one mean is determined by drawing from the level-zero 
Gaussian distribution, and whose level-one covariance is specified by the grammar. 
(Alternatively one may use a prior such as a cut-off inverse power law for diagonal 
# covariance entries.) The level-zero cluster survives this rule-firing event and can 

participate in further rule firings. For spherical Gaussians, this rule may be 
summarized as: 

ClusterO( y,<T,c)-> ClusterO(y,CX,c + 1), Clusterl( y',<T , ,C,fe) 

E =j-i\y-y't -ih 

2a 0 

Here y and y' are mean feature vectors, E is an energy function whose Boltzmann 
probability distribution contributes to the stochastic behavior of the grammar as 
described below. Based on the relative probabilities of these three rules in the 
stochastic grammar, the level-zero cluster generates some number of level-one 
parameterized clusters and distractors, and then dies. 

The level-one clusters also can serve as the left-hand-side of several rules in the 
stochastic parameterized grammar. One rule kills the cluster. One rule allows it to 
generate a data vector (interpreted as a real cluster member, not a distractor) 
according to a Gaussian using the cluster’s mean and covariance; the cluster symbol 
survives as well. And one special rule takes two clusters on its left-hand-side and 
generates a single data vector by a suitably weighted average of the parent mean and 
covariance parameters. Both parents survive the event. This rule is the origin of 
multiple parentage in the data model. For scalar covariance (spherical Gaussians), 
this rule may be summarized as: 

Cluster 1( y,<J,c,k), Cluster 1( y\(j' t c',k') 

Cluster 1 (y, O', c,k + 1 ), Cluster 1 ( y', a', c',/:'* 1 ), 

Datum(x,( c, k 4* 1 ), ( c\ k' + 1 )) 

-A 

Finally, as discussed for a previous single-parent stochastic grammar [1], a global 
permutation removes all identifying indices ( c,k ) from all the generated data vectors. 

Each rule in the grammar has an energy function which induces an unnormalized 
Boltzmann probability factor exp (-E/T). By analogy with statistical mechanics, 
we take the probability of an entire derivation d through the grammar to be the 


product of the Boltzmann factors for the all rules that fired in the derivation, 
normalized by the partition function which is the sum of all such products: 

Pr= ”p(-^S £ .) /Z 

1 red 

Z = X eX P("I £ r) 

d * red 

From this distribution one can derive Bayesian inference algorithms for the cluster 
means and covariances given data generated by the stochastic parameterized 
grammar. 


2.2 Objective function 


If we let / index the observed data vectors and CC index the clusters, then we can 
record which data vectors were generated by which parents using two arrays N with 
different numbers of indices: 

N ia = l if i was generated by CL alone, zero otherwise 
N ia p = 1 if i was generated by CC and f5, zero otherwise. 


The simplest objective function for inferring N and the cluster means y from data x 
which we can derive from the grammar as outlined above using the methods of 
[4] [5] is: 


~y a f -/*]+£ N iaA^\ x i ~ ((?« + V 7 2 )|f ~ y ] 


with constraints 


^e(0.1),^e{0.1). 
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We can reduce the number of variables to solve for by noticing that a change of 
variables is possible: 

Define M ia = 1 if / was generated by CC alone or in concert with another cluster, 
zero otherwise. Then 


N u -M u ( 2-S,) 

■S,= 5 X 

a 

The objective function becomes 

E = jfo - y a ( -n]-(v- n)^s,a - $) 

ia ZO i / 

- X M '° M * (y a - V /2 |f 
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with constraints 


^*€{0,1}, 

S^.S2 ' 
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A similar change of variable applies for data generated by up to three parents: 

f, r -S^M la M lr - 5frM ia M :e 

-w* + w» 

S, = 2X 

a 

We turn now to the construction of iterative algorithms for optimizing under these 
objective functions and constraints. 

2.3 Algorithm 

To perform the constrained optimization, we may consider multi-parent clustering 
as a modification of the existing soft-max style clustering in which the WTA 
(winner-take-all) or WMTA (winner might take all) constraint is replaced with n- 
winners by means of a dual encoding of a membership M and its complement 
M = 1 — M , If CL indexes the clusters and / indexes the data, then 

a 

which can be implemented via 

M ai +M ai = 1 

+s = n 

a 

The latter three lines can be translated into alternative soft-max objective functions 
analogous to a Mean Field Theory Potts glass effective energy; 

E = - /i) + rXM^logM, - 1) + T^M^logM^ - 1) 

« Cd cd 

- 1) + X^(A /«, + - 1) + X4S ~ n) 

' ® / V a J 

Here D includes the distance metric in the first term of the objective function (1), 
and can also locally reflect the quadratic terms in (1) in an iterative algorithm as is 
done in the soft-assign approach to quadratic assignment optimization [3][7], 
Taking derivatives of E with respect to each type of variable, and initializing the 



Lagrange multipliers to zero, we can derive aggressive update dynamics (large 
descent steps) similar to the soft-assign algorithm [6][3]: 

=exp[(/i-D ffl )/r]; 

*2 = i; 

*,° = i; 


and then iteratively, 

< 

M;=wr^r+I«r); 

P 
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fi 

^ = ^/(<+^ rev ); 
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Then the cluster means are updated and the above steps iterated, with occasional 
decreases in temperature according to a fixed e.g. exponential schedule. If this 
algorithm failed to converge it would be possible to back off and do the gradient 
descent steps more slowly than the constraint-enforcement (ascent-like) steps. 
However our numerical experiments show good convergence in the present context. 
Convergence theory for soft-assign is dealt with in [7][8]. 

The number of clusters could be varied with repeated runs, e.g. so as to produce a 
reasonably small average data-to-cluster-center distance without too many cluster 
centers and a relatively high likelihood for the overall clustering as measured by 
cross-validation [9], 

3 Results 

3.1 Data 

To demonstrate the algorithm, we show an example using two clusters of data 
points generated from 2-D Gaussian distributions. One third of the points were 
generated from a zero mean Gaussian. Another third were generated from a 
Gaussian centered at (10,10). The final third were generated from a combination of 
the first two Gaussians. 

3.2 Clustering 

The results shown are the estimated means and the probability of several types of 
errors. There are three types of errors that can occur.: 

(1) A point that should be classified as coming from both clusters could be assigned 
to one or no clusters. 

(2) A point that comes from one cluster may be assigned to no clusters. 

(3) A point that comes from one cluster could be assigned to two clusters. 


The probabilities for each of these types of errors is plotted in Figure las a function 
of the parameter fl , which represents a reward for being assigned to a cluster. In 
these experiments we have taken V = fl . From the figure, it can be seen that when 
the reward for joinning a cluster is too small, the points that should belong to only 
one cluster tend to be assigned to no^’clusters. As the reward for joining a cluster 
gets larger, all points are assigned to two clusters. 



Figure 1. Three types of error as a function of fl . 

In Figure 2, we show the estimated means for the two clusters as a function of the 
parameter fl . The average value of the first coordinate for each cluster is plotted. 
For fl too large or too small the means tends toward each other and the joint mean 
of the entire data set. There is an intermediate window of successful operation. 



Figure 2. Cluter means (first coordinate) as a function of fl . 

We are currently performing a similar parameter exploration for higher-dimensional 
feature vectors and more classes. So far we observe a reasonable rate of successful 
multiparent clustering runs for 15 clusters in 10 dimensions. 

In addition, we have also run the algorithm on real biological data consisting of 
1244 feature vectors, each truncated to 5 dimensions, representing log ratios of 
mRNA gene expression measurements from Stuart Kim’s laboratory on the 



nematode worm C. elegans. We used 15 clusters, and varied }X. Depending on the 
value of /i, we observe varying fractions of genes falling into the slack class, 
having single parent clusters, and having two parent clusters. 


4 Discussion 

We have introduced a statistical data model and an optimization algorithm for 
analysing clustered data in which a data vector can belong to zero, one, or several 
clusters. This “multiparent clustering” algorithm, applied recursively, would create 
a Directed Acyclic Graph rather than a tree of hierarchical cluster centers. For 
many applications in data analysis, visualization and information retrieval the DAG 
is a more reasonable or flexible structure to infer. We demonstrated the algorithm 
using synthetic data generated according to the multi-parent clustering statistical 
data model. 
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